Add --all-ranks CLI infrastructure and core processing logic #115

skarjala · 2025-07-01T21:43:57Z

Summary:

Implements functionality for processing multiple rank log files with new --all-ranks flag
Core Processing Logic: handle_all_ranks() function for multi rank processing

Future work:

landing page HTML generation and template system integration

ezyang · 2025-07-02T00:58:09Z

If LLMs were used please post the prompts, thanks! You can use #114 as a model. (If not used that is fine too, I just want to check!)

StrongerXi

Dropped some nits on code dup, but I can see how the duplication might be intentional because all-rank processing might end up (or already is) diverging from one-rank processing, so you make the call:).

StrongerXi · 2025-07-07T16:33:48Z

src/cli.rs

@@ -92,11 +100,12 @@ fn main() -> anyhow::Result<()> {
        strict: cli.strict,
        strict_compile_id: cli.strict_compile_id,
        custom_parsers: Vec::new(),
-        custom_header_html: cli.custom_header_html,
+        custom_header_html: cli.custom_header_html.clone(),


Why clone now?

If I'm not mistaken, this is necessary because ParseConfig needs to own the string and cli is passed by reference.

The emphasis of the question is on "why now" rather than just "why", the code compiled and seemed to work already before.

I think what happened is during your changes, cli was changed to a reference. It no longer is in this current PR version, so this .clone() shouldn't be necessary here.

Just to clarify, in my current cli.rs file, I would need the clone in handle_one_rank() because handle_one_rank takes in &Cli and can't move the String out of a borrowed struct. Is this right?

The clone happens because you want to create ParseConfig using a Cli ref, but do you need 1 ParseConfig per rank?

Hint: Are there differences between the ParseConfig created for each rank?

Hmm since ParseConfig is the same for all ranks, I could create ParseConfig once before processing ranks. And then pass that to handle_one_rank instead of passing cli. This would avoid repeated cloning of custom_header_html.

Is this the direction you're suggesting?

Yes, that sounds reasonable

src/cli.rs

xmfan · 2025-07-07T18:17:29Z

src/cli.rs

+
+        // Extract rank number from filename
+        let rank_num = if let Some(pos) = rank_name.find("rank_") {
+            let after_rank = &rank_name[pos + 5..];


take a look at String.strip_prefix

xmfan · 2025-07-07T18:20:06Z

src/cli.rs

+            let after_rank = &rank_name[pos + 5..];
+            after_rank
+                .chars()
+                .take_while(|c| c.is_ascii_digit())


what are the log file name patterns you need to match?

Currently it supports rank_N.log where N is a numeric variable. The logic uses strip_prefix("rank_") then extracts consecutive ASCII digits as the rank number. Ex:

rank_0.log, rank_1.log, rank_10.log

rank_0_worker.log

something_rank_5.log

But rejects files without the "rank_" prefix or without digits after it

I mean, does the compiler logs actually output log file names with that variance? You only need to handle the filenames logged via TORCH_TRACE right?

You're right, I'll take out the unnecessary complexity.

xmfan · 2025-07-07T18:21:24Z

src/cli.rs

+        }
+
+        // Add link to this rank's page
+        rank_links.push((rank_num.clone(), format!("rank_{rank_num}/index.html")));


Don't want to hard code this here. Suppose someone changes the single rank codepath to change the directory where index.html is stored, then this line would break! Is there a way we could ensure that this line remains relevant even if the single rank logic changes?

Fixed! Now it uses the actual output path returned by handle_one_rank() instead of hardcoding "index.html".

I tested this by running both single rank and multi-rank processing with different output directories and verified that the generated links correctly point to the actual output files in their respective subdirectories, so that it will remain correct even if the single-rank output structure changes.

src/cli.rs

xmfan · 2025-07-09T22:15:20Z

src/cli.rs

 }

 fn main() -> anyhow::Result<()> {
    let cli = Cli::parse();
+
+    if cli.all_ranks {
+        return handle_all_ranks(&cli);


cli is no longer needed after handle_all_ranks, which means you can take by value, which in rust defaults to move semantics unless the Cli type implements the Copy trait (it doesn't).

xmfan

I would recommend splitting this PR up a bit. There's pre-existing code that can be refactored e.g. PathBuf logic, handle_one_rank helper. Then there can be another PR to add handle_all_ranks

xmfan · 2025-07-10T18:59:21Z

src/cli.rs

+
+        // Add link to this rank's page using the actual output path from handle_one_rank
+        let rank_link = format!("rank_{}/{}", rank_num, main_output_path.display());
+        rank_links.push((rank_num.clone(), rank_link));


Suggested change

rank_links.push((rank_num.clone(), rank_link));

rank_links.push((rank_num, rank_link));

xmfan · 2025-07-10T19:15:56Z

src/cli.rs

+    rank_path: &PathBuf,
+    rank_out_dir: &PathBuf,
+    cli: &Cli,
+    create_output_dir: bool,


create_output_dir seems avoidable

Try rewriting your callsites, so that this function either always creates the output directory, or the callsite always creates the output directory before calling into this.

Hint: can you rewrite all your newly added directory creating logic using setup_output_directory?

Just to clarify, this would mean to remove the create_output_dir parameter and move directory creation to the callsites:

Single rank: uses existing setup_output_directory
Multi-rank: calls fs::create_dir(&rank_out_dir)? before handle_one_rank

Is this what you meant?

that, or you could always call setup_output_directory from within handle_one_rank

xmfan · 2025-07-10T19:17:37Z

src/cli.rs

    let path = if cli.latest {
-        let input_path = cli.path;
+        let input_path = &cli.path;


doesn't look like this change is needed

xmfan · 2025-07-10T19:18:14Z

src/cli.rs

+    handle_one_rank(&path, &out_path, &cli, false)?;
+
+    if !cli.no_browser {
+        opener::open(out_path.join("index.html"))?;


what's MAIN_OUTPUT_FILENAME for?

xmfan · 2025-07-10T19:24:40Z

src/cli.rs

+            }
+
+            // Extract rank number from the pattern
+            let after_prefix = &filename[31..]; // Remove "dedicated_log_torch_trace_rank_"


strip_prefix?

look at line 225, looks very very similar. could we avoid computing things twice?

Maybe it's easier to express if you didn't use .filter here?

To avoid computing things twice, I'm intending to use filter_map to extract rank numbers once during collection, then returning Vec<(DirEntry, String)>. Is this approach feasible to address your feedback?

yeah that can work. I was just thinking of falling back this logic to a foreach loop

xmfan · 2025-07-10T19:28:16Z

src/cli.rs

+            };
+
+            // Only support PyTorch TORCH_TRACE files: dedicated_log_torch_trace_rank_0_hash.log
+            if !filename.starts_with("dedicated_log_torch_trace_rank_")


https://doc.rust-lang.org/src/core/str/mod.rs.html#2400

Add --all-ranks CLI infrastructure and core processing logic

15a74df

facebook-github-bot added the cla signed label Jul 1, 2025

Merge branch 'pytorch:main' into pr1-core-logic

4dfe60e

skarjala requested review from xmfan, bdhirsh and StrongerXi July 3, 2025 21:45

skarjala marked this pull request as draft July 3, 2025 21:46

skarjala added 3 commits July 3, 2025 16:08

Fix Lint Error

cae8159

Update cli.rs

43e2cb3

Update cli.rs

4ba11f4

skarjala marked this pull request as ready for review July 4, 2025 03:10

StrongerXi reviewed Jul 7, 2025

View reviewed changes

xmfan reviewed Jul 7, 2025

View reviewed changes

src/cli.rs Outdated Show resolved Hide resolved

xmfan reviewed Jul 7, 2025

View reviewed changes

src/cli.rs Outdated Show resolved Hide resolved

xmfan reviewed Jul 7, 2025

View reviewed changes

src/cli.rs Outdated Show resolved Hide resolved

xmfan reviewed Jul 7, 2025

View reviewed changes

skarjala added 3 commits July 7, 2025 12:03

Fix cli.rs to Address PR feedback

eaff4df

Update cli.rs

52d8260

Update cli.rs

8f4b1b6

skarjala marked this pull request as draft July 7, 2025 20:37

skarjala added 2 commits July 7, 2025 13:40

Update cli.rs to fix Lint error

5bbf21c

Update cli.rs

24c52cd

skarjala marked this pull request as ready for review July 7, 2025 20:51

skarjala requested review from xmfan and StrongerXi July 7, 2025 21:02

skarjala added 2 commits July 8, 2025 16:13

Update cli.rs to only allow files in TORCH_TRACE format

13f0f5c

Update cli.rs

03e06e0

xmfan reviewed Jul 9, 2025

View reviewed changes

src/cli.rs Show resolved Hide resolved

Update cli.rs to fix Lint

a529cd2

xmfan reviewed Jul 9, 2025

View reviewed changes

Update cli.rs to fix clone & reference issues

44cd79e

xmfan reviewed Jul 10, 2025

View reviewed changes

src/cli.rs

let path = if cli.latest {

let input_path = cli.path;

let input_path = &cli.path;

Copy link

Member

xmfan Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look like this change is needed

xmfan reviewed Jul 10, 2025

View reviewed changes

skarjala marked this pull request as draft July 14, 2025 15:44

	rank_links.push((rank_num.clone(), rank_link));
	rank_links.push((rank_num, rank_link));

Add --all-ranks CLI infrastructure and core processing logic #115

Are you sure you want to change the base?

Add --all-ranks CLI infrastructure and core processing logic #115

Uh oh!

Conversation

skarjala commented Jul 1, 2025

Uh oh!

ezyang commented Jul 2, 2025

Uh oh!

StrongerXi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xmfan Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xmfan Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xmfan Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xmfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xmfan Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xmfan Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

xmfan Jul 9, 2025 •

edited

Loading

xmfan Jul 10, 2025 •

edited

Loading

xmfan Jul 7, 2025 •

edited

Loading

xmfan Jul 10, 2025 •

edited

Loading

xmfan Jul 10, 2025 •

edited

Loading